Spark 3.2: Fix predicate pushdown in row-level operations by aokolnychyi · Pull Request #4023 · apache/iceberg

aokolnychyi · 2022-02-01T18:58:51Z

This PR fixes predicate pushdown in row-level operations in Spark 3.2. Previously, we would not extract filters and MERGE conditions such as t.id = s.id and t.dep IN ('hr') would not be pushed down.

aokolnychyi · 2022-02-01T19:06:47Z

cc @rdblue @RussellSpitzer @singhpk234 @szehon-ho @flyrain

szehon-ho

Lgtm, one small comment.

And just for my understanding (as not super familiar), this issue seems to affects only 'conjunctive predicates' , ie with AND. Was reading method 'splitCojunctivePredicate' and we never push down OR's in any case?

szehon-ho · 2022-02-01T19:56:08Z

...cala/org/apache/spark/sql/execution/datasources/v2/RowLevelCommandScanRelationPushDown.scala


-      val (scan, output) = PushDownUtils.pruneColumns(
-        scanBuilder, relation, relation.output, Seq.empty)
+      val (scan, output) = PushDownUtils.pruneColumns(scanBuilder, relation, relation.output, Nil)


Nit, is it necessary change , Seq.empty => Nil?

I changed it so that it can fit on one line just like the new filter pushdown logic.

Hi @aokolnychyi @szehon-ho any reason for not passing pushedFilters here instead of Nil?

actually nvm, this is only to prune columns.

singhpk234

Looks good to me. Thanks @aokolnychyi !!!

rdblue · 2022-02-01T20:25:36Z

...cala/org/apache/spark/sql/execution/datasources/v2/RowLevelCommandScanRelationPushDown.scala

+      tableAttrs: Seq[AttributeReference]): (Seq[Filter], Seq[Expression]) = {
+
+    val tableAttrSet = AttributeSet(tableAttrs)
+    val filters = splitConjunctivePredicates(cond).filter(_.references.subsetOf(tableAttrSet))


Was this what was preventing pushdown before? We weren't filtering out expressions that referenced columns outside of the table?

Yes, we did not split the condition before into parts and did not remove filters that referenced both tables.

@szehon-ho ˆˆ

This comment provides a little bit more info to answer your question above.
We treated t.id = s.id and t.dep IN ('hr') as a single predicate that couldn't be converted as it referenced both tables. Instead, we now split it into parts and convert whatever we can (i.e. t.dep IN ('hr') in this case).

rdblue · 2022-02-01T20:26:28Z

spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java

+
+    Snapshot mergeSnapshot = table.currentSnapshot();
+    String deletedDataFilesCount = mergeSnapshot.summary().get(SnapshotSummary.DELETED_FILES_PROP);
+    Assert.assertEquals("Must overwrite only 1 file", "1", deletedDataFilesCount);


Other tests use the listener to check the expressions that were pushed down directly. Should we do that in this test?

I think I missed that. Could you point me to an example?

aokolnychyi · 2022-02-01T20:47:05Z

Thanks for reviewing, @singhpk234 @szehon-ho @rdblue!

(cherry picked from commit 5d599e1)

github-actions bot added the spark label Feb 1, 2022

Spark 3.2: Fix predicate pushdown in row-level operations

383d2d6

aokolnychyi force-pushed the fix-predicate-pushdown-merge branch from 76317e5 to 383d2d6 Compare February 1, 2022 19:08

szehon-ho reviewed Feb 1, 2022

View reviewed changes

singhpk234 approved these changes Feb 1, 2022

View reviewed changes

rdblue reviewed Feb 1, 2022

View reviewed changes

rdblue approved these changes Feb 1, 2022

View reviewed changes

aokolnychyi merged commit 5edf20e into apache:master Feb 1, 2022

jackye1995 added this to the Iceberg 0.13.1 Release milestone Feb 10, 2022

amogh-jahagirdar pushed a commit to amogh-jahagirdar/iceberg that referenced this pull request Feb 10, 2022

Spark 3.2: Fix predicate pushdown in row-level operations (apache#4023)

03a4d92

amogh-jahagirdar mentioned this pull request Feb 10, 2022

0.13.1 Cherry-Picks #4087

Merged

jackye1995 pushed a commit that referenced this pull request Feb 10, 2022

Spark 3.2: Fix predicate pushdown in row-level operations (#4023)

5d599e1

samarthjain pushed a commit to samarthjain/incubator-iceberg that referenced this pull request Apr 6, 2022

Spark 3.2: Fix predicate pushdown in row-level operations (apache#4023)

5221eed

(cherry picked from commit 5d599e1)

vanliu-tx pushed a commit to BKBASE-Plugin/iceberg that referenced this pull request May 11, 2022

Spark 3.2: Fix predicate pushdown in row-level operations (apache#4023)

83ec8ed

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 9, 2023

Spark 3.2: Fix predicate pushdown in row-level operations (apache#4023)

a145a8d

Conversation

aokolnychyi commented Feb 1, 2022

Uh oh!

aokolnychyi commented Feb 1, 2022

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Feb 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

szehon-ho left a comment •

edited

Loading